{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Tuning Hyperparameters with Grid Search](11.03-Tuning-Hyperparameters-with-Grid-Search.ipynb) | [Contents](../README.md) | [Wrapping Up](12.00-Wrapping-Up.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chaining Algorithms Together to Form a Pipeline\n",
    "\n",
    "Most machine learning problems we have discussed so far consist of at least a\n",
    "preprocessing step and a classification step. The more complicated the problem, the longer\n",
    "this **processing chain** might get. One convenient way to glue multiple processing steps\n",
    "together and even use them in grid search is by using the `Pipeline` class from scikit-learn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing pipelines in scikit-learn\n",
    "\n",
    "The Pipeline class itself has a `fit`, a `predict`, and a `score` method, which behave just\n",
    "like any other estimator in scikit-learn. The most common used case of the `Pipeline` class\n",
    "is to chain different preprocessing steps together with a supervised model like a classifier.\n",
    "\n",
    "Let's return to the breast cancer dataset from [Chapter 5](05.00-Using-Decision-Trees-to-Make-a-Medical-Diagnosis.ipynb), *Using Decision Trees to Make a\n",
    "Medical Diagnosis*. Using scikit-learn, we import the dataset and split it into training and test\n",
    "sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_breast_cancer\n",
    "import numpy as np\n",
    "cancer = load_breast_cancer()\n",
    "X = cancer.data.astype(np.float32)\n",
    "y = cancer.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, random_state=37\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of the $k$-NN algorithm, we could fit a support vector machine (SVM) to the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.svm import SVC\n",
    "svm = SVC()\n",
    "svm.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without straining our brains too hard, this algorithm achieves an accuracy score of 65%:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.65034965034965031"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svm.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now if we wanted to run the algorithm again using some preprocessing step (for example,\n",
    "by scaling the data first with `MinMaxScaler`),we would do the preprocessing step by hand\n",
    "and then feed the preprocessed data into the classifiers `fit` method.\n",
    "\n",
    "An alternative is to use a pipeline object. Here, we want to specify a list of processing steps,\n",
    "where each step is a tuple containing a name (any string of our choosing) and an instance of\n",
    "an estimator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "pipe = Pipeline([(\"scaler\", MinMaxScaler()), (\"svm\", SVC())])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we created two steps: the first, called `\"scaler\"`, is an instance of `MinMaxScaler`, and\n",
    "the second, called `\"svm\"`, is an instance of `SVC`. Now we can fit the pipeline like any other\n",
    "scikit-learn estimator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False))])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, the `fit` method first calls `fit` on the first step (the scaler), then it transforms the\n",
    "training data using the scaler, and finally it fits the SVM with the scaled data.\n",
    "\n",
    "And voila! When we score the classifier on the test data, we see a drastic improvement in\n",
    "performance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.95104895104895104"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling the score method on the pipeline first transforms the test data using the scaler and\n",
    "then calls the score method on the SVM using the scaled test data. And scikit-learn did all\n",
    "this with only four lines of code!\n",
    "\n",
    "The main benefit of using the pipeline, however, is that we can now use this single\n",
    "estimator in `cross_val_score` or `GridSearchCV`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using pipelines in grid searches\n",
    "\n",
    "Using a pipeline in a grid search works the same way as using any other estimator.\n",
    "\n",
    "We define a parameter grid to search over and construct a `GridSearchCV` from the pipeline\n",
    "and the parameter grid. When specifying the parameter grid, there is, however, a slight\n",
    "change. We need to specify for each parameter which step of the pipeline it belongs to. Both\n",
    "parameters that we want to adjust, `C` and `gamma`, are parameters of SVC, the second step. In\n",
    "the preceding section, we gave this step the name `\"svm\"`. The syntax to define a parameter\n",
    "grid for a pipeline is to specify for each parameter the step name, followed by `__` (a double\n",
    "underscore), followed by the parameter name.\n",
    "\n",
    "Hence, we would construct the parameter grid as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],\n",
    "              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this parameter grid, we can use `GridSearchCV` as usual:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=10, error_score='raise',\n",
       "       estimator=Pipeline(steps=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',\n",
       "  max_iter=-1, probability=False, random_state=None, shrinking=True,\n",
       "  tol=0.001, verbose=False))]),\n",
       "       fit_params={}, iid=True, n_jobs=1,\n",
       "       param_grid={'svm__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,\n",
       "       scoring=None, verbose=0)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "grid = GridSearchCV(pipe, param_grid=param_grid, cv=10)\n",
    "grid.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The best score in the grid is stored in `best_score_`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.97652582159624413"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, the best parameters are stored in `best_params_`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'svm__C': 1, 'svm__gamma': 1}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But recall that the cross-validation score might be overly optimistic. In order to know the\n",
    "true performance of the classifier, we need to score it on the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.965034965034965"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grid.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In contrast to the grid search we did before, now for each split in the cross-validation,\n",
    "`MinMaxScaler` is refit with only the training splits, and no information is leaked from the\n",
    "test split into the parameter search.\n",
    "\n",
    "This makes it easy to build a pipeline to chain together a whole variety of steps!\n",
    "\n",
    "How would you mix and match different estimators in a single pipeline? Turn to page 330 to find the answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Tuning Hyperparameters with Grid Search](11.03-Tuning-Hyperparameters-with-Grid-Search.ipynb) | [Contents](../README.md) | [Wrapping Up](12.00-Wrapping-Up.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
